17 research outputs found

    Self multi-head attention for speaker recognition

    Get PDF
    Most state-of-the-art Deep Learning (DL) approaches forspeaker recognition work on a short utterance level. Given thespeech signal, these algorithms extract a sequence of speakerembeddings from short segments and those are averaged to ob-tain an utterance level speaker representation. In this work wepropose the use of an attention mechanism to obtain a discrim-inative speaker embedding given non fixed length speech utter-ances. Our system is based on a Convolutional Neural Network(CNN) that encodes short-term speaker features from the spec-trogram and a self multi-head attention model that maps theserepresentations into a long-term speaker embedding. The atten-tion model that we propose produces multiple alignments fromdifferent subsegments of the CNN encoded states over the se-quence. Hence this mechanism works as a pooling layer whichdecides the most discriminative features over the sequence toobtain an utterance level representation. We have tested thisapproach for the verification task for the VoxCeleb1 dataset.The results show that self multi-head attention outperforms bothtemporal and statistical pooling methods with a18%of rela-tive EER. Obtained results show a58%relative improvementin EER compared to i-vector+PLDAPeer ReviewedPostprint (published version

    Speaker characterization by means of attention pooling

    Get PDF
    State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.Peer ReviewedPostprint (published version

    Auto-encoding nearest neighbor i-vectors for speaker verification

    Get PDF
    In the last years, i-vectors followed by cosine or PLDA scoringtechniques were the state-of-the-art approach in speaker veri-fication. PLDA requires labeled background data, and thereexists a significant performance gap between the two scoringtechniques. In this work, we propose to reduce this gap by us-ing an autoencoder to transform i-vector into a new speaker vec-tor representation, which will be referred to as ae-vector. Theautoencoder will be trained to reconstruct neighbor i-vectors in-stead of the same training i-vectors, as usual. These neighbori-vectors will be selected in an unsupervised manner accordingto the highest cosine scores to the training i-vectors. The evalua-tion is performed on the speaker verification trials of VoxCeleb-1 database. The experiments show that our proposed ae-vectorsgain a relative improvement of 42% in terms of EER comparedto the conventional i-vectors using cosine scoring, which fillsthe performance gap between cosine and PLDA scoring tech-niques by 92%, but without using speaker labelsPeer ReviewedPostprint (published version

    Self attention networks in speaker recognition

    Get PDF
    Recently, there has been a significant surge of interest in Self-Attention Networks (SANs) based on the Transformer architecture. This can be attributed to their notable ability for parallelization and their impressive performance across various Natural Language Processing applications. On the other hand, the utilization of large-scale, multi-purpose language models trained through self-supervision is progressively more prevalent, for tasks like speech recognition. In this context, the pre-trained model, which has been trained on extensive speech data, can be fine-tuned for particular downstream tasks like speaker verification. These massive models typically rely on SANs as their foundational architecture. Therefore, studying the potential capabilities and training challenges of such models is of utmost importance for the future generation of speaker verification systems. In this direction, we propose a speaker embedding extractor based on SANs to obtain a discriminative speaker representation given non-fixed length speech utterances. With the advancements suggested in this work, we could achieve up to 41% relative performance improvement in terms of EER compared to the naive SAN which was proposed in our previous work. Moreover, we empirically show the training instability in such architectures in terms of rank collapse and further investigate the potential solutions to alleviate this shortcoming.This work was supported by the Spanish Project ADAVOICE PID2019-107579RB-I00 (MICINN).Peer ReviewedPostprint (published version

    Language modelling for speaker diarization in telephonic interviews

    Get PDF
    The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.This work was partially supported by the Spanish Project DeepVoice (TEC2015-69266-P) and by the project PID2019-107579RBI00/ AEI /10.13039/501100011033.Peer ReviewedPostprint (published version

    UPC system for the 2016 MediaEval multimodal person discovery in broadcast TV task

    Get PDF
    The UPC system works by extracting monomodal signal segments (face tracks, speech segments) that overlap with the person names overlaid in the video signal. These segments are assigned directly with the name of the person and used as a reference to compare against the non-overlapping (unassigned) signal segments. This process is performed independently both on the speech and video signals. A simple fusion scheme is used to combine both monomodal annotations into a single one.Postprint (published version

    UPC multimodal speaker diarization system for the 2018 Albayzin challenge

    Get PDF
    This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.Peer ReviewedPostprint (published version

    UPC multimodal speaker diarization system for the 2018 Albayzin challenge

    Get PDF
    This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.Peer ReviewedPostprint (published version

    Sistema UPC per la 2015 MediaEval Multimodal Person Discovery in Broadcast TV Task

    No full text
    .This project verses about the system that UPC developed to participate in the Multimodal Person Discovery in Broadcast TV task in MediaEval 2015. The main objective of this task is to answer the two questions: Who speaks when? and Who appears when using any sources of information in a TV Broadcast scenario.Este proyecto habla sobre el sistema desarrollado por la UPC para la 2015 MediaEval Multimodal Person Discovery in Broadcast TV Task del MediaEval 2015. El objetivo de esta tarea es identificar quien habla y está presente durante la emisión de un vídeo, utilizando cualquier fuente de información.Aquest projecte tracta sobre el sistema desenvolupat per la UPC per la 2015 MediaEval Multimodal Person Discovery in Broadcast TV Task del MediaEval 2015. L'objetciu d'aquesta tasca es identificar qui parla i està present en una emissió de vídeo, utilizant qualsevol font d'informació

    I-vector transformation using k-nearest neighbors for speaker verification

    No full text
    Probabilistic Linear Discriminant Analysis (PLDA) is the most efficient backend for i-vectors. However, it requires labeled background data which can be difficult to access in practice. Unlike PLDA, cosine scoring avoids speaker-labels at the cost of degrading the performance. In this work, we propose a post processing of i-vectors using a Deep Neural Network (DNN) to transform i-vectors into a new speaker vector representation. The DNN will be trained using i-vectors that are similar to the training i-vectors. These similar i-vectors will be selected in an unsupervised manner. Using the new vector representation, we will score the experimental trials using cosine scoring. The evaluation was performed on the speaker verification trials of VoxCeleb-1 database. The experiments have shown that with the help of the similar i-vectors the new vectors become more discriminative than the original i-vectors. The new vectors have gained a relative improvement of 53% in terms of EER, compared to the conventional i-vector/PLDA system, but without using speaker labels.This work has been developed in the framework of DeepVoice Project (TEC2015-69266-P), funded by Spanish Ministry.Peer Reviewe
    corecore